Windows Azure Table Overview (part 2) - Azure Tables Versus Traditional Databases

10/17/2010 5:20:12 PM

2. Azure Tables Versus Traditional Databases

Usually, the people who experience the least amount of trouble moving from their existing storage systems to Azure’s Table service are the ones who accept that this is not a normal database system, and don’t expect to find typical SQL features.

Let’s take a quick look at a brief comparison of Azure tables versus traditional database tables.

2.1. Denormalized data

In a traditional database system, DBAs go to great lengths to ensure that data isn’t repeated, and that data is consistent. This is done through a process called normalization. Depending on how much time you have and how obsessive you are, you could normalize your data to multiple levels. Normalization serves a good purpose: it ensures data integrity. On the other hand, it hurts performance.

Data stored in one Azure table does not have any relationship with data stored in another Azure table. You cannot specify foreign key constraints to ensure that data in one table is always consistent with data in another table. You must ensure that your schema is sufficiently denormalized.

Any user running a high-performance database system has probably denormalized parts of that system’s schema, or at least experimented with it. Denormalization involves storing multiple copies of data and achieving data consistency by performing multiple writes. Though this gives the user better performance and flexibility, the onus is now on the developer to maintain data integrity. If you forget to update one table, your data could become inconsistent, and difficult-to-diagnose errors could show up at the application level. Also, all the extra writes you must do take time and can hurt performance.

Note: In forums, in books, and on the blogosphere, you may see some “experts” recommending that people denormalize to improve performance. No one takes the trouble to explain why this can help. The reason behind it is simple: data in different tables is typically stored in different files on disk, or sometimes on different machines. Normalization implies database joins that must load multiple tables into memory. This requires reading from multiple places and, thus, hurts performance. One of the reasons Azure tables provide good performance is that the data is denormalized by default.

If you’re willing to live with very short periods of inconsistency, you can do asynchronous writes or move the writes to a worker process. In some sense, this is inevitable. All major Web 2.0 sites (such as Flickr) frequently run tools that check for data consistency issues and fix them.

2.2. No schema

Several advantages come with a fixed schema for all your data. Schema can act as a safety net, trading flexibility for safety—any errors in your code where you mismatch data types can be caught early. However, this same safety net gets in your way when you have semistructured data. Changing a table’s structure to add/change columns is difficult (and sometimes impossible) to do (ALTER TABLE is the stuff of nightmares).

Azure tables have no schema. Entities in the same table can have completely different properties, or a different number of properties. Like denormalization, the onus is on the developer to ensure that updates reflect the correct schema.

2.3. No distributed transactions

If you are accustomed to using transactions to maintain consistency and integrity, the idea of not having any transactions can be scary. However, in any distributed storage system, transactions across machines hurt performance. As with normalization, the onus is on the developer to maintain consistency and run scripts to ensure that.

This is not as scary or as difficult as it sounds. Large services such as Facebook and Flickr have long eschewed transactions as they scaled out. It’s a fundamental trade-off that you make with cloud-based storage systems.

Though distributed transactions aren’t available, Windows Azure tables have support for “entity group transactions” that allows you to batch requests together for entities in the same partition.

2.4. Black box

If you’ve ever run a database for a service, you’ve probably mucked with its configuration. The first thing a lot of developers do when setting up MySQL is dive into my.cnf and tune the various knobs that are available. Entire books have been written on tuning indexes, picking storage engines, and optimizing query plans.

The Azure Table service does not provide you with the individual knobs to perform that level of tuning. Since it is a large distributed system, the service itself automatically tunes based on data, workload, traffic, and various other factors. The only option you have control over is how your data is partitioned (which will be addressed shortly). This lack of knobs to turn can be a blessing, since the system takes care of tuning for you.

2.5. Row size limits

An entity can have only up to 1 MB of data, a number that includes the names of your properties. If you are used to sticking large pieces of data in each row (a questionable practice in itself), you might easily reach this limit. In cases such as this, the right thing to do is to use the blob storage service, and store a pointer to the blob in the entity. This is similar to storing large pieces of data in the filesystem, and having the database maintain a pointer to the specific file.

2.6. Lack of support for familiar tools

Like other cloud storage systems, the Azure Table service is pretty nascent. This means the ecosystem around it is nascent, too. Tools you’re comfortable with for SQL Server, Oracle, or MySQL mostly won’t work with Azure tables, and you might find it difficult to find replacements. This is a problem that will be solved over time as more people adopt the service.

Related -----------------

- Windows Azure Table Overview (part 1) - Core Concepts

Other -----------------

- Exploring Group Policy in Windows 7

- Working with Multiple Local Group Policy Objects

- The Windows Azure Sandbox

- Windows Azure : Peeking Under the Hood with a Command Shell (part 2) - Running the Command Proxy

- Windows Azure : Peeking Under the Hood with a Command Shell (part 1) - Building the Command Shell Proxy

- Windows 7 : Using Any Search Engine from the Address Bar

- Windows 7 : Understanding Internet Explorer Advanced Options